The following list provides one potential structure of the data analysis report. As this is the final project, the following suggestions are intended to provide one viable route for your project while leaving you as much freedom as possible.
Before writing your analysis report, you may want to explore this data set and read about the coronavirus to generate the hypothesis or question to be answered in this report, i.e., the question(s) of interest. You can be creative on this question so long as it meets three conditions.
In this Exploratory Data Analysis of the World Health Organization (WHO) COVID-19 database, we determine the impact of COVID-19 on global economic health. We consider many factors in this analysis such as unemployment rates, energy prices, Gross Domestic Product (GDP), and other such criteria to get a sense of how economies progress as the disease travels through the population.
In early 2020, an outbreak of the coronavirus known as COVID-19 spread from China impacting people all over the world. The initial outbreak caused many countries to immediately go into lockdown and citizens were required to stay at home to quarantine as a means to contain the spread of the virus. As a result, countries were producing goods in lower quantities and thus GDP went down. The purpose of this analysis is to visualize how global economies responded to COVID-19 over time. Furthermore, if such a pandemic ever occurs in the future, we may be able to predict how each country will be impacted and come up with a solution where the economy will not suffer as much.
The WHO provides a COVID-19 dataset which provides information on the number of COVID-19 cases a country has, updated daily. Here is the initial data:
covid <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Date_reported = col_date(format = ""),
## Country_code = col_character(),
## Country = col_character(),
## WHO_region = col_character(),
## New_cases = col_double(),
## Cumulative_cases = col_double(),
## New_deaths = col_double(),
## Cumulative_deaths = col_double()
## )
head(covid)
## # A tibble: 6 x 8
## Date_reported Country_code Country WHO_region New_cases Cumulative_cases
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2020-01-03 AF Afghan~ EMRO 0 0
## 2 2020-01-04 AF Afghan~ EMRO 0 0
## 3 2020-01-05 AF Afghan~ EMRO 0 0
## 4 2020-01-06 AF Afghan~ EMRO 0 0
## 5 2020-01-07 AF Afghan~ EMRO 0 0
## 6 2020-01-08 AF Afghan~ EMRO 0 0
## # ... with 2 more variables: New_deaths <dbl>, Cumulative_deaths <dbl>
At the time of writing, this dataset contains 101,436 observations with 8 variables, which are Date_reported, Country_code, Country, WHO_region, New_cases, Cumulative_cases, New_deaths, and Cumulataive_deaths. The first four variables are categorical, and are used to classify/categorize days and regions together. The last four variables show what we are measuring: the number of cases and deaths each country has.
We first analyze the initial WHO dataset to get a sense of the trends of data per fiscal quarter, which we define to be a three month period with the first day of the first quarter being January 3, 2020 (So then the subsequent quarters will be 3 months apart and also start on the 3rd). We first take the data set from the WHO and filter out the dates that were not a part of the past year. In other words, these are the dates between 1/3/2020 and 12/31/2020. Then modify Date_reported to a list that details the start of the fiscal quarter we described earlier. We show the data set below:
head(covid)
## # A tibble: 6 x 8
## Date_reported Country_code Country WHO_region New_cases Cumulative_cases
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1/3/2020 AR Argent~ AMRO 0 0
## 2 1/3/2020 AR Argent~ AMRO 0 0
## 3 1/3/2020 AR Argent~ AMRO 0 0
## 4 1/3/2020 AR Argent~ AMRO 0 0
## 5 1/3/2020 AR Argent~ AMRO 0 0
## 6 1/3/2020 AR Argent~ AMRO 0 0
## # ... with 2 more variables: New_deaths <dbl>, Cumulative_deaths <dbl>
We wish to see the average number of cases/deaths per day from COVID-19 per quarter. We group by the country and the quarter to do so. Furthermore, we are interested in seeing the percent change of cases/deaths after the preceding quarter. The following code demonstrates this process.
Ave_Summary$CasePercent_Change = NA
Ave_Summary$DeathPercent_Change = NA
for(i in c(0:50)){
for(j in c(1: 3)){
Ave_Summary$CasePercent_Change[4*i+j+1] = abs(Ave_Summary$Ave_NewCase[4*i+j] - Ave_Summary$Ave_NewCase[4*i+j + 1])/Ave_Summary$Ave_NewCase[4*i+j]
Ave_Summary$DeathPercent_Change[4*i+j+1] = abs(Ave_Summary$Ave_NewDeath[4*i+j] - Ave_Summary$Ave_NewDeath[4*i+j + 1])/Ave_Summary$Ave_NewDeath[4*i+j]
}
}
And now we plot the time series of these plots below. Note that we used a logarithmic scale for the raw number of deaths and cases.
figure1 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = log(Ave_NewDeath), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure2 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = log(Ave_NewCase), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure3 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = CasePercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure4 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = DeathPercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
ggarrange(figure1, figure2, figure3, figure4)
For better visualization of the progression of the disease for these countries, we have an interactive plot which you can slide to see the number of new deaths over new cases using the raw numbers.
Ave_Summary %>% plot_ly(
x= ~Ave_NewCase,
y= ~Ave_NewDeath,
frame = ~Date_reported,
text=~Country,
hoverinfo="Country",
color=~Country,
type = 'scatter',
mode = 'markers',
showlegend = T
)
We showed earlier the graphs for the average number of cases and deaths. Another way to visualize the data is to view for the cumulative number of cases/deaths in the fiscal quarter.
Cumul_Summary = covid %>% group_by(Country, Date_reported) %>%
summarize(Cumul_Case = sum(New_cases),
Cumul_Death = sum(New_deaths)
)
## `summarise()` regrouping output by 'Country' (override with `.groups` argument)
Cumul_Summary$CasePercent_Change = NA
Cumul_Summary$DeathPercent_Change = NA
for(i in c(0:50)){
for(j in c(1: 3)){
Cumul_Summary$CasePercent_Change[4*i+j+1] = abs(Cumul_Summary$Cumul_Case[4*i+j] - Cumul_Summary$Cumul_Case[4*i+j + 1])/Cumul_Summary$Cumul_Case[4*i+j]
Cumul_Summary$DeathPercent_Change[4*i+j+1] = abs(Cumul_Summary$Cumul_Death[4*i+j] - Cumul_Summary$Cumul_Death[4*i+j + 1])/Cumul_Summary$Cumul_Death[4*i+j]
}
}
Now we plot the time series similar to earlier.
figure5 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = log(Cumul_Case), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure6 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = (Cumul_Death), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure7 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = CasePercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure8 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = DeathPercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
ggarrange(figure5, figure6, figure7, figure8)
And below we have the interactive plot similar to above but for the cumulative deaths over cases.
Cumul_Summary %>%
plot_ly(
x= ~Cumul_Case,
y= ~Cumul_Death,
frame = ~Date_reported,
text=~Country,
hoverinfo="Country",
color=~Country,
type = 'scatter',
mode = 'markers',
showlegend = T
)
Notice how the time series for Average Cases and Cumulative Cases are similar. We can say the same for Average Deaths and Cumulative Deaths. Perhaps there is a relation between the average number of COVID-19 cases/deaths with the total number for the fiscal quarter. Further analysis can explain why this is the case.
Another way to visualize this data is given by boxplots using a logarithmic scale. We display the data below:
figure9 = ggplot(Ave_Summary, aes(x = Date_reported, y = log(Ave_NewCase))) + geom_boxplot()
figure10 = ggplot(Ave_Summary, aes(x = Date_reported, y = log(Ave_NewDeath))) + geom_boxplot()
figure11 = ggplot(Cumul_Summary, aes(x = Date_reported, y = log(Cumul_Case))) + geom_boxplot()
figure12 = ggplot(Cumul_Summary, aes(x = Date_reported, y = log(Cumul_Death))) + geom_boxplot()
ggarrange(figure9, figure10, figure11, figure12)
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
Propose an appropriate model to answer the questions of interest.
Fit the proposed model in (4) and explain your results.
Conduct model diagnostics and/or sensitivity analysis.
Conclude your analysis with a discussion of your findings and caveats of your approach.